K-Means 클러스터링

데이터의 종속 변수 값을 예측하는 것이 아니라 독립 변수의 특성만으로 데이터의 그룹 즉, 클러스터(cluster)를 형성하는 작업을 클러스터링(clustering)이라고 한다.

클러스터의 정의는? k-means 갯수는 사용자가 정해줘야 한다. 중심점을 찍어줘야 한다. 클러스터 중심점을 2개 찍는다고 하면 데이터들과 클러스터 중심점과의 거리를 측정해서 가까운 쪽으로 클러스터가 된다.

K-Means

K-Means 클러스터링 알고리즘은 가장 단순하고 빠른 클러스터링 알고리즘의 하나이다.

기본적인 방법은 다음과 같은 목적함수 값이 최소화될 때까지 클러스터의 수 $K$와 각 클러스터의 중심(centroid) $\mu_k$를 반복해서 찾는 것이다.

$$ J = \sum_{k=1}^K \sum_{i \in C_k} d(x_i, \mu_k) $$

이 식에서 $d$는 두 데이터의 유사도 함수(Similarity Function) 혹은 거리(Distance)로 다음과 같이 정의한다.

$$ d(x_i, \mu_k) = || x_i - \mu_k ||^2 $$

세부 알고리즘은 다음과 같다.

임의의 중심값 $\mu_k$ 를 고른다. (보통 데이터 샘플 중의 하나를 선택)
중심에서 각 샘플 데이터까지의 거리를 계산
각 데이터 샘플에서 가장 가까운 중심을 선택하여 클러스터 갱신
다시 만들어진 클러스터에 대해 중심을 다시 계산하고 1 ~ 4를 반복한다. (이게 중요. 중심점이 바뀐다.)



In [1]:

    
X = np.array([[7, 5],[5, 7],[7, 7],[4, 4],[4, 6],[1, 4],[0, 0],[2, 2],[8, 7],[6, 8],[5, 5],[3, 7]], dtype=float)
plt.scatter(X[:,0], X[:,1], s=100)
plt.show()



In [2]:

    
from sklearn.cluster import KMeans
model = KMeans(n_clusters=2, init="random", n_init=1, max_iter=1, random_state=1).fit(X)



In [3]:

    
c0, c1 = model.cluster_centers_
print(c0, c1)
plt.scatter(X[model.labels_==0,0], X[model.labels_==0,1], s=100, marker='v', c='r')
plt.scatter(X[model.labels_==1,0], X[model.labels_==1,1], s=100, marker='^', c='b')
plt.scatter(c0[0], c0[1], s=100, c="r")
plt.scatter(c1[0], c1[1], s=100, c="b")
plt.show()









    



[ 6.6  6.8] [ 2.71428571  4.        ]



In [4]:

    
def kmeans_df(c0, c1):
    df = pd.DataFrame(np.hstack([X, 
                                np.linalg.norm(X - c0, axis=1)[:, np.newaxis], 
                                np.linalg.norm(X - c1, axis=1)[:, np.newaxis], 
                                model.labels_[:, np.newaxis]]),
                    columns=["x0", "x1", "d0", "d1", "c"])
    return df

kmeans_df(c0, c1)



In [5]:

    
print(X[model.labels_==0,0].mean(), X[model.labels_==0,1].mean())
print(X[model.labels_==1,0].mean(), X[model.labels_==1,1].mean())









    



6.33333333333 6.5
2.33333333333 3.83333333333



In [6]:

    
model.score(X)









    Out[6]:





-63.004081632653012



In [9]:

    
model = KMeans(n_clusters=2, init="random", n_init=1, max_iter=2, random_state=0).fit(X)
c0, c1 = model.cluster_centers_
print(c0, c1)
plt.scatter(X[model.labels_==0,0], X[model.labels_==0,1], s=100, marker='v', c='r')
plt.scatter(X[model.labels_==1,0], X[model.labels_==1,1], s=100, marker='^', c='b')
plt.scatter(c0[0], c0[1], s=100, c="r")
plt.scatter(c1[0], c1[1], s=100, c="b")
kmeans_df(c0, c1)









    



[ 1.  2.] [ 5.44444444  6.22222222]






    Out[9]:






  
    
      
      x0
      x1
      d0
      d1
      c
    
  
  
    
      0
      7.0
      5.0
      6.708204
      1.978277
      1.0
    
    
      1
      5.0
      7.0
      6.403124
      0.895806
      1.0
    
    
      2
      7.0
      7.0
      7.810250
      1.739164
      1.0
    
    
      3
      4.0
      4.0
      3.605551
      2.650413
      1.0
    
    
      4
      4.0
      6.0
      5.000000
      1.461438
      1.0
    
    
      5
      1.0
      4.0
      2.000000
      4.969040
      0.0
    
    
      6
      0.0
      0.0
      2.236068
      8.267891
      0.0
    
    
      7
      2.0
      2.0
      1.000000
      5.448978
      0.0
    
    
      8
      8.0
      7.0
      8.602325
      2.671292
      1.0
    
    
      9
      6.0
      8.0
      7.810250
      1.862562
      1.0
    
    
      10
      5.0
      5.0
      5.000000
      1.300522
      1.0
    
    
      11
      3.0
      7.0
      5.385165
      2.565199
      1.0



In [10]:

    
model.score(X)









    Out[10]:





-45.777777777777729



In [11]:

    
(np.linalg.norm(X[model.labels_==0] - c0, axis=1)**2).sum() + \
(np.linalg.norm(X[model.labels_==1] - c1, axis=1)**2).sum()









    Out[11]:





45.777777777777779



In [12]:

    
model = KMeans(n_clusters=2, init="random", n_init=1, max_iter=100, random_state=0).fit(X)
c0, c1 = model.cluster_centers_
print(c0, c1)
plt.scatter(X[model.labels_==0,0], X[model.labels_==0,1], s=100, marker='v', c='r')
plt.scatter(X[model.labels_==1,0], X[model.labels_==1,1], s=100, marker='^', c='b')
plt.scatter(c0[0], c0[1], s=100, c="r")
plt.scatter(c1[0], c1[1], s=100, c="b")
kmeans_df(c0, c1)









    



[ 1.  2.] [ 5.44444444  6.22222222]






    Out[12]:






  
    
      
      x0
      x1
      d0
      d1
      c
    
  
  
    
      0
      7.0
      5.0
      6.708204
      1.978277
      1.0
    
    
      1
      5.0
      7.0
      6.403124
      0.895806
      1.0
    
    
      2
      7.0
      7.0
      7.810250
      1.739164
      1.0
    
    
      3
      4.0
      4.0
      3.605551
      2.650413
      1.0
    
    
      4
      4.0
      6.0
      5.000000
      1.461438
      1.0
    
    
      5
      1.0
      4.0
      2.000000
      4.969040
      0.0
    
    
      6
      0.0
      0.0
      2.236068
      8.267891
      0.0
    
    
      7
      2.0
      2.0
      1.000000
      5.448978
      0.0
    
    
      8
      8.0
      7.0
      8.602325
      2.671292
      1.0
    
    
      9
      6.0
      8.0
      7.810250
      1.862562
      1.0
    
    
      10
      5.0
      5.0
      5.000000
      1.300522
      1.0
    
    
      11
      3.0
      7.0
      5.385165
      2.565199
      1.0

K-Means++

K-Means++ 알고리즘은 최초의 중심값을 설정하기 위한 알고리즘이다.

중심값을 저장할 집합 $M$ 준비 (처음에 떨어뜨려 놓는다?)
일단 하나의 중심 $\mu_0$를 랜덤하게 선택하여 $M$에 넣는다.
$M$에 속하지 않는 모든 샘플 $x_i$에 대해 거리 $d(M, x_i)$를 계산. $d(M, x_i)$는 $M$안의 모든 샘플 $\mu_k$에 대해 $d(\mu_k, x_i)$를 계산하여 가장 작은 값 선택
$d(M, x_i)$에 비례한 확률로 다음 중심 $\mu$를 선택.
$K$개의 중심을 선택할 때까지 반복
K-Means 알고리즘 사용

예: Iris



In [13]:

    
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import datasets



In [14]:

    
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target

estimators = {'k_means_iris_3': KMeans(n_clusters=3),
              'k_means_iris_8': KMeans(n_clusters=8)}

fignum = 1
for name, est in estimators.items():
    fig = plt.figure(fignum)
    plt.clf()
    ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
    plt.cla()
    est.fit(X)
    labels = est.labels_
    ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float), s=100, cmap=mpl.cm.jet)
    ax.w_xaxis.set_ticklabels([])
    ax.w_yaxis.set_ticklabels([])
    ax.w_zaxis.set_ticklabels([])
    ax.set_xlabel('Petal width')
    ax.set_ylabel('Sepal length')
    ax.set_zlabel('Petal length')
    plt.title(name)
    fignum = fignum + 1

plt.show()

예: Digit Image



In [15]:

    
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale

digits = load_digits()
data = scale(digits.data)

def print_digits(images, labels):
    f = plt.figure(figsize=(10,2))
    plt.subplots_adjust(top=1, bottom=0, hspace=0, wspace=0.05)
    i = 0
    while (i < 10 and i < images.shape[0]):
        ax = f.add_subplot(1, 10, i + 1)
        ax.imshow(images[i], cmap=plt.cm.bone)
        ax.grid(False)
        ax.table
        ax.set_title(labels[i])
        ax.xaxis.set_ticks([])
        ax.yaxis.set_ticks([])
        plt.tight_layout()
        i += 1

print_digits(digits.images, range(10))



In [16]:

    
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test, images_train, images_test = \
    train_test_split(data, digits.target, digits.images, test_size=0.25, random_state=42)



In [17]:

    
from sklearn.cluster import KMeans
clf = KMeans(init="k-means++", n_clusters=10, random_state=42)
clf.fit(X_train)
print_digits(images_train, clf.labels_)



In [18]:

    
y_pred = clf.predict(X_test)
def print_cluster(images, y_pred, cluster_number):
    images = images[y_pred == cluster_number]
    y_pred = y_pred[y_pred == cluster_number]
    print_digits(images, y_pred)
for i in range(10):
    print_cluster(images_test, y_pred, i)



In [19]:

    
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)









    Out[19]:





array([[ 0,  0, 43,  0,  0,  0,  0,  0,  0,  0],
       [20,  0,  0,  7,  0,  0,  0, 10,  0,  0],
       [ 5,  0,  0, 31,  0,  0,  0,  1,  1,  0],
       [ 1,  0,  0,  1,  0,  1,  4,  0, 39,  0],
       [ 1, 50,  0,  0,  0,  0,  1,  2,  0,  1],
       [ 1,  0,  0,  0,  1, 41,  0,  0, 16,  0],
       [ 0,  0,  1,  0, 44,  0,  0,  0,  0,  0],
       [ 0,  0,  0,  0,  0,  1, 34,  1,  0,  5],
       [21,  0,  0,  0,  0,  3,  1,  2, 11,  0],
       [ 0,  0,  0,  0,  0,  2,  3,  3, 40,  0]])



In [20]:

    
from sklearn.decomposition import PCA

pca = PCA(n_components=2).fit(X_train)
reduced_X_train = pca.transform(X_train)
clf = KMeans(init="k-means++", n_clusters=10, random_state=42)
clf.fit(reduced_X_train)
centroids = clf.cluster_centers_

xmin, xmax = reduced_X_train[:,0].min(), reduced_X_train[:,0].max()
ymin, ymax = reduced_X_train[:,1].min(), reduced_X_train[:,1].max()
XGrid, YGrid = np.meshgrid(np.arange(xmin, xmax, (xmax-xmin)/1000),
                           np.arange(ymin, ymax, (ymax-ymin)/1000))
ZGrid = np.reshape(clf.predict(np.array([XGrid.ravel(), YGrid.ravel()]).T), XGrid.shape)
cmap = mpl.colors.ListedColormap(sns.color_palette("Set1"))
plt.contourf(XGrid, YGrid, ZGrid, cmap=mpl.cm.Accent)
plt.scatter(reduced_X_train[:, 0], reduced_X_train[:, 1], c=y_train, cmap=mpl.cm.Accent)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='.', s=1000, color='k', alpha=0.6, zorder=10)
plt.show()

http://scikit-learn.org/stable/modules/clustering.html#clustering 클러스터링 잘 안 된 경우

	x0	x1	d0	d1	c
0	7.0	5.0	1.843909	4.400835	0.0
1	5.0	7.0	1.612452	3.771537	0.0
2	7.0	7.0	0.447214	5.231381	0.0
3	4.0	4.0	3.820995	1.285714	1.0
4	4.0	6.0	2.720294	2.377617	1.0
5	1.0	4.0	6.260990	1.714286	1.0
6	0.0	0.0	9.476286	4.833978	1.0
7	2.0	2.0	6.648308	2.123724	1.0
8	8.0	7.0	1.414214	6.077728	0.0
9	6.0	8.0	1.341641	5.176477	0.0
10	5.0	5.0	2.408319	2.494893	0.0
11	3.0	7.0	3.605551	3.013575	1.0

	x0	x1	d0	d1	c
0	7.0	5.0	6.708204	1.978277	1.0
1	5.0	7.0	6.403124	0.895806	1.0
2	7.0	7.0	7.810250	1.739164	1.0
3	4.0	4.0	3.605551	2.650413	1.0
4	4.0	6.0	5.000000	1.461438	1.0
5	1.0	4.0	2.000000	4.969040	0.0
6	0.0	0.0	2.236068	8.267891	0.0
7	2.0	2.0	1.000000	5.448978	0.0
8	8.0	7.0	8.602325	2.671292	1.0
9	6.0	8.0	7.810250	1.862562	1.0
10	5.0	5.0	5.000000	1.300522	1.0
11	3.0	7.0	5.385165	2.565199	1.0